feat: add Qwen3.5 MoE calibration module by Sehyo · Pull Request #2383 · vllm-project/llm-compressor

Sehyo · 2026-02-18T11:12:37Z

Summary

Add CalibrationQwen3_5MoeSparseMoeBlock calibration module that unfuses Qwen3.5's 3D fused expert parameters into individual Qwen3_5MoeMLP modules with nn.Linear layers, enabling NVFP4 quantization of expert weights
Register the module in modeling/__init__.py
Add NVFP4 quantization example script for Qwen/Qwen3.5-397B-A17B

Details

Qwen3.5 MoE (Qwen3_5MoeSparseMoeBlock) stores all expert weights in fused 3D nn.Parameter tensors (gate_up_proj: [num_experts, 2*intermediate, hidden], down_proj: [num_experts, hidden, intermediate]). The calibration module unfuses these into individual MLP modules so targets="Linear" can match and quantize them.

The implementation follows the same pattern as CalibrateQwen3VLMoeTextSparseMoeBlock with is_permanent=True, and includes disable_onloading() for safe CPU access to offloaded parameters on large models.

gemini-code-assist · 2026-02-18T11:12:53Z

Summary of Changes

Hello @Sehyo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized calibration module for Qwen3.5 Mixture-of-Experts (MoE) models, designed to facilitate efficient NVFP4 quantization of their expert weights. By dynamically restructuring the MoE block to expose individual expert layers as standard linear modules, it enables the application of fine-grained quantization techniques. A new example script demonstrates this process, ensuring broader compatibility and optimized performance for these large language models.

Highlights

Qwen3.5 MoE Calibration Module: Introduced CalibrationQwen3_5MoeSparseMoeBlock to enable NVFP4 quantization for Qwen3.5 MoE models.
Expert Parameter Unfusing: This new module unfuses Qwen3.5's 3D fused expert parameters into individual nn.Linear layers, making them targetable for quantization.
Module Registration: The new calibration module has been registered in modeling/__init__.py.
NVFP4 Quantization Example: An example script (qwen3_5_moe_example.py) was added to demonstrate NVFP4 quantization for the Qwen/Qwen3.5-397B-A17B model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/quantization_w4a4_fp4/qwen3_5_moe_example.py
- Added a new example script for NVFP4 quantization of Qwen3.5 MoE models.
src/llmcompressor/modeling/init.py
- Imported CalibrationQwen3_5MoeSparseMoeBlock.
- Registered the new Qwen3.5 MoE calibration module.
src/llmcompressor/modeling/qwen3_5_moe.py
- Added CalibrationQwen3_5MoeSparseMoeBlock to unfuse 3D expert parameters into individual nn.Linear modules for quantization.
- Implemented SequentialQwen3_5MoeExperts to manage the unfused expert layers.
- Included logic to handle offloaded parameters safely during unfusing.

Activity

No activity (comments, reviews, etc.) was provided in the context.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-02-18T11:13:26Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request introduces a calibration module for Qwen3.5 MoE models, enabling NVFP4 quantization. The changes include the core module implementation, its registration within the modeling package, and a comprehensive example script demonstrating its usage on a large-scale model. The implementation correctly unfuses expert weights into individual nn.Linear layers, which is crucial for quantization. The approach of using disable_onloading to handle large model weights on the CPU is well-considered. I have identified one potential issue in the forward pass logic that could lead to errors for MoE models configured with top_k=1, and I have provided a suggestion to address it.

src/llmcompressor/modeling/qwen3_5_moe.py

Sehyo · 2026-02-18T13:31:11Z

Requesting review alt. ready tag and enhancement tag.
@dsikka @kylesayrs

Sehyo · 2026-02-18T19:33:59Z

Quantized version with this PR:
https://huggingface.co/Sehyo/Qwen3.5-397B-A17B-NVFP4

dsikka

This looks really good - thank you!

mergify · 2026-02-18T19:58:37Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

aabbccddwasd · 2026-02-19T03:02:40Z

keep getting RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Sehyo · 2026-02-19T15:25:22Z

keep getting RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Is this an error from VLLM?
Can you please share GPU setup, vLLM version and the trace?
Is it same problem as in vLLM issue #33544?

Sehyo · 2026-02-19T16:06:44Z

I have detected an issue in the current upstream version of VLLM which causes the Qwen3.5 NVFP4 quant to fail.

in Qwen 3.5 Gated Delta Net, we have some fused / merged projections:
in_proj_qkvz = Q + K + V + Z
in_proj_ba = B + A

and VLLM does fusing like:
("in_proj_qkvz", "in_proj_qkv", (0, 1, 2)),
("in_proj_qkvz", "in_proj_z", 3),
("in_proj_ba", "in_proj_b", 0),
("in_proj_ba", "in_proj_a", 1),

.. Which assumes plain weight tensors which are concatable.. But NVFP4 format stores weights in weight_packed (4bit packed) way. --> Fused weights are garbage

I am currently trying to write a fix for this, if I succeed to get it working will submit a PR to vllm repo as well.

dsikka · 2026-02-19T16:10:57Z

I have detected an issue in the current upstream version of VLLM which causes the Qwen3.5 NVFP4 quant to fail.

in Qwen 3.5 Gated Delta Net, we have some fused / merged projections: in_proj_qkvz = Q + K + V + Z in_proj_ba = B + A

and VLLM does fusing like: ("in_proj_qkvz", "in_proj_qkv", (0, 1, 2)), ("in_proj_qkvz", "in_proj_z", 3), ("in_proj_ba", "in_proj_b", 0), ("in_proj_ba", "in_proj_a", 1),

.. Which assumes plain weight tensors which are concatable.. But NVFP4 format stores weights in weight_packed (4bit packed) way. --> Fused weights are garbage

I am currently trying to write a fix for this, if I succeed to get it working will submit a PR to vllm repo as well.

If we skip quantizing the linear attn layers, wont this issue be resolved?
It seems like you skipped them in your ignore list

dsikka · 2026-02-19T16:12:57Z

Do you mind adding a test similar to the tests in this folder: https://github.com/vllm-project/llm-compressor/tree/main/tests/llmcompressor/modeling

Sehyo · 2026-02-19T17:50:13Z

I have detected an issue in the current upstream version of VLLM which causes the Qwen3.5 NVFP4 quant to fail.
in Qwen 3.5 Gated Delta Net, we have some fused / merged projections: in_proj_qkvz = Q + K + V + Z in_proj_ba = B + A
and VLLM does fusing like: ("in_proj_qkvz", "in_proj_qkv", (0, 1, 2)), ("in_proj_qkvz", "in_proj_z", 3), ("in_proj_ba", "in_proj_b", 0), ("in_proj_ba", "in_proj_a", 1),
.. Which assumes plain weight tensors which are concatable.. But NVFP4 format stores weights in weight_packed (4bit packed) way. --> Fused weights are garbage
I am currently trying to write a fix for this, if I succeed to get it working will submit a PR to vllm repo as well.

If we skip quantizing the linear attn layers, wont this issue be resolved? It seems like you skipped them in your ignore list

Yes for those layers it does not matter.
However I noticed in VLLM that load_weights have no mapping between unfused checkpoint weights (the Qwen 3.5 HF checkpoint stores unfused names: in_proj_qkv.weight, in_proj_z.weight etc).. Still looking into it.
-- Edit: Never mind, noticed the issue I was running into was fixed in a commit in the latest vllm nightly.

Sehyo · 2026-02-19T17:50:22Z

Do you mind adding a test similar to the tests in this folder: https://github.com/vllm-project/llm-compressor/tree/main/tests/llmcompressor/modeling

Sure, will do it!

Sehyo · 2026-02-20T09:52:42Z

@dsikka Tests have been added.

Sehyo · 2026-02-22T15:12:05Z

Review Request

Sehyo · 2026-02-25T22:14:34Z

Came to my attention that the MTP modules are dropped from the quant. I am away until sunday but can fix it then.

HDCharles · 2026-03-02T15:59:41Z

@JartX i would generally not rely on such specific behavior to be maintained when going from a non quantized to a quantized model. If something like that is needed i would start by doing AWQ or GPTQ with your use cases as a significant part of the calibration data.

JartX · 2026-03-02T16:06:14Z

@JartX i would generally not rely on such specific behavior to be maintained when going from a non quantized to a quantized model. If something like that is needed i would start by doing AWQ or GPTQ with your use cases as a significant part of the calibration data.

@HDCharles
Hi, thanks for your reply. That's what I'm doing with GPTQ, but what I'm seeing is that Qwen3Vl and Omni work, but Qwen3.5 35B doesn't. It seems to be making things up in the JSON. For example, the official Qwen3 example works correctly with Omni. That's why I'm confused.
Could it be necessary to add any more to ignore? Maybe?

HDCharles · 2026-03-02T18:50:30Z

thats a use-cast specific issue, in practice its assumed that there will be some quantization loss, there are a variety of techniques you can use to quantize your model so it overperforms on your specific use case. As far as this being what you are doing, it looks like your calibrating on ultrachat and nemotron, are those datasets which conform to specific json language?

JartX · 2026-03-02T19:13:15Z

Ese es un problema específico de Use-Cast. En la práctica, se asume que habrá cierta pérdida de cuantificación. Existen diversas técnicas para cuantificar el modelo y que rinda más en su caso de uso específico. En cuanto a lo que está haciendo, parece que está calibrando en UltraChat y Nemotron. ¿Se ajustan estos conjuntos de datos a un lenguaje JSON específico?

The new script

import os
import shutil
import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)
from datasets import concatenate_datasets, load_dataset
from huggingface_hub import snapshot_download
from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

MODEL_ID = "Qwen/Qwen3.5-35B-A3B"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-GPTQ-W4A16-G32"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map=None,
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

samples_per_dataset = NUM_CALIBRATION_SAMPLES // 4

ds_ultrachat = load_dataset(
    "HuggingFaceH4/ultrachat_200k",
    split=f"train_sft[:{samples_per_dataset}]",
)
ds_nemotron = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v2",
    split=f"chat[:{samples_per_dataset}]",
)
ds_code = load_dataset(
    "ise-uiuc/Magicoder-Evol-Instruct-110K",
    split=f"train[:{samples_per_dataset}]",
)
ds_hermes = load_dataset(
    "NousResearch/hermes-function-calling-v1",
    split=f"train[:{samples_per_dataset}]",
)

ds_ultrachat = ds_ultrachat.select_columns(["messages"])
ds_nemotron = ds_nemotron.select_columns(["messages"])

def format_magicoder(example):
    return {
        "messages": [
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["response"]},
        ]
    }
ds_code = ds_code.map(format_magicoder, remove_columns=ds_code.column_names)

ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant"}
def format_hermes(example):
    messages = [
        {"role": ROLE_MAP[t["from"]], "content": t["value"]}
        for t in example["conversations"]
        if t["from"] in ROLE_MAP
    ]
    return {"messages": messages}
ds_hermes = ds_hermes.map(format_hermes, remove_columns=ds_hermes.column_names)
ds_hermes = ds_hermes.filter(lambda x: len(x["messages"]) > 0)

ds = concatenate_datasets([ds_ultrachat, ds_nemotron, ds_code, ds_hermes])
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )
ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = GPTQModifier(
    config_groups={
        "group_0": QuantizationScheme(
            targets=["Linear"],
            weights=QuantizationArgs(
                num_bits=4,
                type=QuantizationType.INT,
                strategy=QuantizationStrategy.GROUP,
                group_size=32,
                symmetric=True,
                dynamic=False,
            ),
        )
    },
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
        "re:.*embed_tokens$",
        "re:.*shared_expert_gate$",
        "re:.*mlp\\.shared_expert$",
        "re:.*linear_attn.*",
        "re:.*input_layernorm$",
        "re:.*post_attention_layernorm$",
        "re:.*norm$",
    ],
    bypass_divisibility_checks=False,
    block_size=128,
    dampening_frac=0.01,
    actorder="static",
    offload_hessians=False,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    moe_calibrate_all_experts=False,
    pipeline="sequential",
)

model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

cache_dir = snapshot_download(MODEL_ID, allow_patterns=["*.json"])
for filename in [
    "vocab.json",
    "tokenizer_config.json",
    "preprocessor_config.json",
    "video_preprocessor_config.json",
]:
    src = os.path.join(cache_dir, filename)
    dst = os.path.join(SAVE_DIR, filename)
    if os.path.exists(src):
        shutil.copyfile(src, dst)
        print(f"Copied: {filename}")
    else:
        print(f"Not found in cache: {filename}")

@HDCharles
Mmm No, it isn't. I added a JSON dataset. But it doesn't make sense, because in previous models, JSON generation worked even without passing specific datasets. For example, it works correctly in the dense Qwen3.5 27B model or Qwen3 Vl 30B A3B. It must be something in the MOE that I'm missing.

JartX · 2026-03-03T09:41:28Z

@Sehyo @dsikka @HDCharles

I have been investigating an issue that appears to affect all quantized models capable of running on vLLM main.

Specifically, all quantized versions fail to generate structured_output when the input includes images, regardless of the quantization technique used. This has been observed across the following formats:

LlmCompressor-AWQ

LlmCompressor-GPTQ

GPTQ

FP8

The failure occurs consistently during the generation phase when multimodal (image) data is present in the prompt.

https://huggingface.co/cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit
https://huggingface.co/QuantTrio/Qwen3.5-35B-A3B-AWQ
https://huggingface.co/btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit

The last quant code:

import os
import shutil

import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)
from datasets import concatenate_datasets, load_dataset
from huggingface_hub import snapshot_download
from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

MODEL_ID = "Qwen/Qwen3.5-35B-A3B"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-GPTQ-W4A16-G32"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map=None,
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

samples_per_dataset = NUM_CALIBRATION_SAMPLES // 4

ds_ultrachat = load_dataset(
    "HuggingFaceH4/ultrachat_200k",
    split=f"train_sft[:{samples_per_dataset}]",
)

ds_nemotron = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v2",
    split=f"chat[:{samples_per_dataset}]",
)

ds_code = load_dataset(
    "ise-uiuc/Magicoder-Evol-Instruct-110K",
    split=f"train[:{samples_per_dataset}]",
)

ds_hermes = load_dataset(
    "NousResearch/hermes-function-calling-v1",
    split=f"train[:{samples_per_dataset}]",
)

ds_ultrachat = ds_ultrachat.select_columns(["messages"])
ds_nemotron = ds_nemotron.select_columns(["messages"])

def format_magicoder(example):
    return {
        "messages": [
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["response"]},
        ]
    }

ds_code = ds_code.map(format_magicoder, remove_columns=ds_code.column_names)

ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant"}

def format_hermes(example):
    messages = [
        {"role": ROLE_MAP[t["from"]], "content": t["value"]}
        for t in example["conversations"]
        if t["from"] in ROLE_MAP
    ]
    return {"messages": messages}

ds_hermes = ds_hermes.map(format_hermes, remove_columns=ds_hermes.column_names)
ds_hermes = ds_hermes.filter(lambda x: len(x["messages"]) > 0)

ds = concatenate_datasets([ds_ultrachat, ds_nemotron, ds_code, ds_hermes])
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = GPTQModifier(
    config_groups={
        "group_0": QuantizationScheme(
            targets=["Linear"],
            weights=QuantizationArgs(
                num_bits=4,
                type=QuantizationType.INT,
                strategy=QuantizationStrategy.GROUP,
                group_size=32,
                symmetric=True,
                dynamic=False,
            ),
        )
    },
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
        "re:.*embed_tokens$",
        "re:.*shared_expert_gate$",
        "re:.*mlp\\.shared_expert$",
        "re:.*linear_attn.*",
        "re:.*input_layernorm$",
        "re:.*post_attention_layernorm$",
        "re:.*norm$",
        "re:.*gate$",
        "re:.*shared_expert$",
        "re:.*RMSNorm",
        "re:.*shared_expert\\.gate_proj$",
        "re:.*shared_expert\\.up_proj$",
        "re:.*shared_expert\\.down_proj$"
    ],
    bypass_divisibility_checks=False,
    block_size=128,
    dampening_frac=0.01,
    actorder="static",
    offload_hessians=False,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    moe_calibrate_all_experts=True,
    pipeline="sequential",
)

model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

cache_dir = snapshot_download(MODEL_ID, allow_patterns=["*.json"])

for filename in [
    "vocab.json",
    "tokenizer_config.json",
    "preprocessor_config.json",
    "video_preprocessor_config.json",
]:
    src = os.path.join(cache_dir, filename)
    dst = os.path.join(SAVE_DIR, filename)
    if os.path.exists(src):
        shutil.copyfile(src, dst)
        print(f"Copied: {filename}")
    else:
        print(f"Not found in cache: {filename}")

Many thanks for your time

UPDATED:
With the last quant and:

import base64
import fitz
import json
import requests
import sys
import time
from pathlib import Path
from PIL import Image
from io import BytesIO
from typing import List, Optional, Literal
from pydantic import BaseModel

TARGET_DPI = 200
PYMUPDF_BASE_DPI = 200

VLLM_URL = "http://192.168.1.75/v1/chat/completions"
MODEL = "QWEN35"

class Vendor(BaseModel):
    name: Optional[str]
    address: Optional[str]
    cif_nif: Optional[str]
    confidence: Literal["confident", "partial", "ambiguous"]

class Client(BaseModel):
    name: Optional[str]
    address: Optional[str]
    cif_nif: Optional[str]
    confidence: Literal["confident", "partial", "ambiguous"]

class Dates(BaseModel):
    issue_date: Optional[str]
    due_date: Optional[str]

class Amounts(BaseModel):
    subtotal: Optional[float]
    total_discount: Optional[float]
    tax_rate: Optional[float]
    tax_amount: Optional[float]
    total: Optional[float]
    currency: Optional[str]

class LineItem(BaseModel):
    description: str
    quantity: Optional[float]
    unit_price: Optional[float]
    discount_percentage: Optional[float]
    discount_amount: Optional[float]
    total: Optional[float]
    confidence: Literal["confident", "partial", "ambiguous"]

class PaymentInfo(BaseModel):
    iban: Optional[str]
    payment_method: Optional[str]

class FieldWithIssue(BaseModel):
    field_name: str
    issue_type: Literal["not_found", "partial", "conflicting", "ambiguous"]
    notes: Optional[str]

class InvoiceExtraction(BaseModel):
    extraction_status: Literal["success", "partial", "failed"]
    failure_reason: Optional[str]
    invoice_number: Optional[str]
    vendor: Optional[Vendor]
    client: Optional[Client]
    dates: Optional[Dates]
    amounts: Optional[Amounts]
    line_items: Optional[List[LineItem]]
    payment_info: Optional[PaymentInfo]
    fields_with_issues: List[FieldWithIssue]

SYSTEM_PROMPT = """\
You are an expert invoice extraction system. Your sole task is to extract information EXPLICITLY visible in the document.

STRICT RULES:
1. LINE ITEM SEPARATION: Extract EVERY SINGLE ROW as a separate line item object. NEVER merge distinct products, services, or descriptions into a single line item. If a row only contains a description, extract it as a separate line item with null amounts.
2. NUMBER PARSING: Spanish/European formats use ',' for decimals and '.' for thousands.
   - A quantity of '1,000' is ALMOST ALWAYS 1 unit (1.0), NOT one thousand.
   - A price of '59,0000' is 59.0.
   - An amount of '1.260,00' is 1260.0.
   - Convert these strictly to standard JSON numbers (floats).
3. MATHEMATICAL VALIDATION: Verify that (Quantity * Unit Price) - Discount = Total for EACH line.
4. DISCOUNT HANDLING:
   - 'discount_percentage': Use this ONLY if there is a '%' sign or the column header explicitly indicates a percentage.
   - 'discount_amount': Use this ONLY if it is a direct monetary deduction or absolute value.

Respond ONLY with valid JSON conforming to the provided schema.
"""

USER_PROMPT_PREFIX = """\
Analyze the provided invoice images (which may span multiple pages) and extract all fields according to the strict JSON schema.
"""

def img_to_b64(img: Image.Image) -> str:
    buf = BytesIO()
    img.save(buf, format="PNG")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

def pdf_to_images(pdf_path: Path, dpi: int = TARGET_DPI) -> list[Image.Image]:
    zoom = dpi / PYMUPDF_BASE_DPI
    mat = fitz.Matrix(zoom, zoom)
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        pages.append(img)
    doc.close()
    return pages

def build_image_content(pages: list[Image.Image]) -> list[dict]:
    content = []
    for img in pages:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{img_to_b64(img)}"},
        })
    content.append({"type": "text", "text": USER_PROMPT_PREFIX})
    return content

def call_vllm(payload: dict) -> dict:
    time.sleep(2)
    response = requests.post(
        VLLM_URL,
        json=payload,
        headers={"Content-Type": "application/json"},
    )
    response.raise_for_status()
    return response.json()

def extract_invoice(pdf_path: str) -> dict:
    pages = pdf_to_images(Path(pdf_path))
    content = build_image_content(pages)

    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": content},
        ],
        "temperature": 0.3,
        "top_k": 20,
        "min_p": 0.2,
        "top_p": 1.0,
       # "max_tokens": 21800,
        "chat_template_kwargs": {"enable_thinking": False},
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": "invoice_extraction",
                "schema": InvoiceExtraction.model_json_schema(),
                "strict": True,
            },
        },
    }

    response_data = call_vllm(payload)
    message = response_data["choices"][0]["message"]

    content_str = message.get("content")
    if content_str is None:
        raise ValueError("None")

    return json.loads(content_str)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit(1)

    result = extract_invoice(sys.argv[1])
    print(json.dumps(result, indent=2, ensure_ascii=False))

I was able to successfully extract the invoice data

Sehyo · 2026-03-03T19:41:57Z

@Sehyo @dsikka @HDCharles

I have been investigating an issue that appears to affect all quantized models capable of running on vLLM main.

Specifically, all quantized versions fail to generate structured_output when the input includes images, regardless of the quantization technique used. This has been observed across the following formats:

LlmCompressor-AWQ

LlmCompressor-GPTQ

GPTQ

FP8

The failure occurs consistently during the generation phase when multimodal (image) data is present in the prompt.

https://huggingface.co/cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit https://huggingface.co/QuantTrio/Qwen3.5-35B-A3B-AWQ https://huggingface.co/btbtyler09/Qwen3.5-35B-A3B-GPTQ-4bit

The last quant code:

import os
import shutil

import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)
from datasets import concatenate_datasets, load_dataset
from huggingface_hub import snapshot_download
from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

MODEL_ID = "Qwen/Qwen3.5-35B-A3B"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-GPTQ-W4A16-G32"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map=None,
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

samples_per_dataset = NUM_CALIBRATION_SAMPLES // 4

ds_ultrachat = load_dataset(
    "HuggingFaceH4/ultrachat_200k",
    split=f"train_sft[:{samples_per_dataset}]",
)

ds_nemotron = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v2",
    split=f"chat[:{samples_per_dataset}]",
)

ds_code = load_dataset(
    "ise-uiuc/Magicoder-Evol-Instruct-110K",
    split=f"train[:{samples_per_dataset}]",
)

ds_hermes = load_dataset(
    "NousResearch/hermes-function-calling-v1",
    split=f"train[:{samples_per_dataset}]",
)

ds_ultrachat = ds_ultrachat.select_columns(["messages"])
ds_nemotron = ds_nemotron.select_columns(["messages"])

def format_magicoder(example):
    return {
        "messages": [
            {"role": "user", "content": example["instruction"]},
            {"role": "assistant", "content": example["response"]},
        ]
    }

ds_code = ds_code.map(format_magicoder, remove_columns=ds_code.column_names)

ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant"}

def format_hermes(example):
    messages = [
        {"role": ROLE_MAP[t["from"]], "content": t["value"]}
        for t in example["conversations"]
        if t["from"] in ROLE_MAP
    ]
    return {"messages": messages}

ds_hermes = ds_hermes.map(format_hermes, remove_columns=ds_hermes.column_names)
ds_hermes = ds_hermes.filter(lambda x: len(x["messages"]) > 0)

ds = concatenate_datasets([ds_ultrachat, ds_nemotron, ds_code, ds_hermes])
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }

ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = GPTQModifier(
    config_groups={
        "group_0": QuantizationScheme(
            targets=["Linear"],
            weights=QuantizationArgs(
                num_bits=4,
                type=QuantizationType.INT,
                strategy=QuantizationStrategy.GROUP,
                group_size=32,
                symmetric=True,
                dynamic=False,
            ),
        )
    },
    ignore=[
        "re:.*lm_head",
        "re:visual.*",
        "re:model.visual.*",
        "re:.*mlp.gate$",
        "re:.*embed_tokens$",
        "re:.*shared_expert_gate$",
        "re:.*mlp\\.shared_expert$",
        "re:.*linear_attn.*",
        "re:.*input_layernorm$",
        "re:.*post_attention_layernorm$",
        "re:.*norm$",
        "re:.*gate$",
        "re:.*shared_expert$",
        "re:.*RMSNorm",
        "re:.*shared_expert\\.gate_proj$",
        "re:.*shared_expert\\.up_proj$",
        "re:.*shared_expert\\.down_proj$"
    ],
    bypass_divisibility_checks=False,
    block_size=128,
    dampening_frac=0.01,
    actorder="static",
    offload_hessians=False,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    moe_calibrate_all_experts=True,
    pipeline="sequential",
)

model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

cache_dir = snapshot_download(MODEL_ID, allow_patterns=["*.json"])

for filename in [
    "vocab.json",
    "tokenizer_config.json",
    "preprocessor_config.json",
    "video_preprocessor_config.json",
]:
    src = os.path.join(cache_dir, filename)
    dst = os.path.join(SAVE_DIR, filename)
    if os.path.exists(src):
        shutil.copyfile(src, dst)
        print(f"Copied: {filename}")
    else:
        print(f"Not found in cache: {filename}")

Many thanks for your time

UPDATED: With the last quant and:

import base64
import fitz
import json
import requests
import sys
import time
from pathlib import Path
from PIL import Image
from io import BytesIO
from typing import List, Optional, Literal
from pydantic import BaseModel

TARGET_DPI = 200
PYMUPDF_BASE_DPI = 200

VLLM_URL = "http://192.168.1.75/v1/chat/completions"
MODEL = "QWEN35"

class Vendor(BaseModel):
    name: Optional[str]
    address: Optional[str]
    cif_nif: Optional[str]
    confidence: Literal["confident", "partial", "ambiguous"]

class Client(BaseModel):
    name: Optional[str]
    address: Optional[str]
    cif_nif: Optional[str]
    confidence: Literal["confident", "partial", "ambiguous"]

class Dates(BaseModel):
    issue_date: Optional[str]
    due_date: Optional[str]

class Amounts(BaseModel):
    subtotal: Optional[float]
    total_discount: Optional[float]
    tax_rate: Optional[float]
    tax_amount: Optional[float]
    total: Optional[float]
    currency: Optional[str]

class LineItem(BaseModel):
    description: str
    quantity: Optional[float]
    unit_price: Optional[float]
    discount_percentage: Optional[float]
    discount_amount: Optional[float]
    total: Optional[float]
    confidence: Literal["confident", "partial", "ambiguous"]

class PaymentInfo(BaseModel):
    iban: Optional[str]
    payment_method: Optional[str]

class FieldWithIssue(BaseModel):
    field_name: str
    issue_type: Literal["not_found", "partial", "conflicting", "ambiguous"]
    notes: Optional[str]

class InvoiceExtraction(BaseModel):
    extraction_status: Literal["success", "partial", "failed"]
    failure_reason: Optional[str]
    invoice_number: Optional[str]
    vendor: Optional[Vendor]
    client: Optional[Client]
    dates: Optional[Dates]
    amounts: Optional[Amounts]
    line_items: Optional[List[LineItem]]
    payment_info: Optional[PaymentInfo]
    fields_with_issues: List[FieldWithIssue]

SYSTEM_PROMPT = """\
You are an expert invoice extraction system. Your sole task is to extract information EXPLICITLY visible in the document.

STRICT RULES:
1. LINE ITEM SEPARATION: Extract EVERY SINGLE ROW as a separate line item object. NEVER merge distinct products, services, or descriptions into a single line item. If a row only contains a description, extract it as a separate line item with null amounts.
2. NUMBER PARSING: Spanish/European formats use ',' for decimals and '.' for thousands.
   - A quantity of '1,000' is ALMOST ALWAYS 1 unit (1.0), NOT one thousand.
   - A price of '59,0000' is 59.0.
   - An amount of '1.260,00' is 1260.0.
   - Convert these strictly to standard JSON numbers (floats).
3. MATHEMATICAL VALIDATION: Verify that (Quantity * Unit Price) - Discount = Total for EACH line.
4. DISCOUNT HANDLING:
   - 'discount_percentage': Use this ONLY if there is a '%' sign or the column header explicitly indicates a percentage.
   - 'discount_amount': Use this ONLY if it is a direct monetary deduction or absolute value.

Respond ONLY with valid JSON conforming to the provided schema.
"""

USER_PROMPT_PREFIX = """\
Analyze the provided invoice images (which may span multiple pages) and extract all fields according to the strict JSON schema.
"""

def img_to_b64(img: Image.Image) -> str:
    buf = BytesIO()
    img.save(buf, format="PNG")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

def pdf_to_images(pdf_path: Path, dpi: int = TARGET_DPI) -> list[Image.Image]:
    zoom = dpi / PYMUPDF_BASE_DPI
    mat = fitz.Matrix(zoom, zoom)
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        pages.append(img)
    doc.close()
    return pages

def build_image_content(pages: list[Image.Image]) -> list[dict]:
    content = []
    for img in pages:
        content.append({
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{img_to_b64(img)}"},
        })
    content.append({"type": "text", "text": USER_PROMPT_PREFIX})
    return content

def call_vllm(payload: dict) -> dict:
    time.sleep(2)
    response = requests.post(
        VLLM_URL,
        json=payload,
        headers={"Content-Type": "application/json"},
    )
    response.raise_for_status()
    return response.json()

def extract_invoice(pdf_path: str) -> dict:
    pages = pdf_to_images(Path(pdf_path))
    content = build_image_content(pages)

    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": content},
        ],
        "temperature": 0.3,
        "top_k": 20,
        "min_p": 0.2,
        "top_p": 1.0,
       # "max_tokens": 21800,
        "chat_template_kwargs": {"enable_thinking": False},
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": "invoice_extraction",
                "schema": InvoiceExtraction.model_json_schema(),
                "strict": True,
            },
        },
    }

    response_data = call_vllm(payload)
    message = response_data["choices"][0]["message"]

    content_str = message.get("content")
    if content_str is None:
        raise ValueError("None")

    return json.loads(content_str)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit(1)

    result = extract_invoice(sys.argv[1])
    print(json.dumps(result, indent=2, ensure_ascii=False))

I was able to successfully extract the invoice data

Hello! I have uploaded nvfp4 quants made with this PR on HuggingFace.
My 122B quant is the most downloaded NVFP4 one on Huggingface with over 70K downloads:
https://huggingface.co/Sehyo/Qwen3.5-122B-A10B-NVFP4

I haven't been able to see any performance issues / accuracy issues myself, but you can test with my model made by this quant from that link if you have the time.

- Remove unnecessary disable_onloading() wrapper in qwen3_5_moe.py - Add hasattr fallback for _no_split_modules in get_no_split_params - Use public match_named_modules API instead of private _match_name

JartX · 2026-03-03T20:04:04Z

@Sehyo Your model works well with the images, unlike the others. The structured output is missing some data. After using the instruction dataset with JSON, as recommended by the other user, it worked correctly. I also want to mention that your PR and the moe_calibrate_all_experts=True code make a big difference compared to the other models. I'm using ROCM RDNA3 and haven't been able to test it until this afternoon in RunPod. RDNA3 doesn't support nvfp4, so I needed to quantize using llmcompressor GPTQ/Awq, or GPTQ. Everything works perfectly for me :) Could you review my latest quantization script and perhaps use it as a GPTQ example in your PR for those who can't run nvfp4, FP8, etc.? Thank you very much!

JartX · 2026-03-03T20:06:11Z

@Sehyo I would switch it to W4A16 Scheme; the group size is for it to work on Exllama in my RDNA3

BenasdTW · 2026-03-03T22:24:29Z

@Sehyo Hi, I’ve encountered a couple of issues while running a modified version of your example code.

Modification to the quantization script:

scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = ["re:.*self_attn.o_proj", "re:.*linear_attn.in_proj_qkv", "re:.*linear_attn.in_proj_z", "re:.*linear_attn.out_proj"]
scheme_1 = NVFP4
scheme_1["targets"] = ["re:.*self_attn.(q|k|v)_proj", "re:.*mlp.experts.*.*_proj"]

ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a*", "re:.*in_proj_b*", "re:.*in_proj_c*"]
recipe = QuantizationModifier(
    config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
)

Expected behavior: self_attn.o_proj, linear_attn.in_proj_qkv and linear_attn.in_proj_z should be quantized to fp8. self_attn.(q|k|v)_proj and mlp quantized to NVFP4.

Result: Only self_attn.o_proj is quantized to fp8, linear_attn.in_proj_qkv and linear_attn.in_proj_z remain unquantized. While the NVFP4 is working as expected.

Another issue: the exported tokenizer metadata appears to use an unexpected class:

"tokenizer_class": "TokenizersBackend",

phaelon74 · 2026-03-04T03:18:17Z

@Sehyo Hi, I’ve encountered a couple of issues while running a modified version of your example code.

Modification to the quantization script:
scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = ["re:.*self_attn.o_proj", "re:.*linear_attn.in_proj_qkv", "re:.*linear_attn.in_proj_z", "re:.*linear_attn.out_proj"]
scheme_1 = NVFP4
scheme_1["targets"] = ["re:.*self_attn.(q|k|v)_proj", "re:.*mlp.experts.*.*_proj"]

ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a*", "re:.*in_proj_b*", "re:.*in_proj_c*"]
recipe = QuantizationModifier(
    config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
)
Expected behavior: self_attn.o_proj, linear_attn.in_proj_qkv and linear_attn.in_proj_z should be quantized to fp8. self_attn.(q|k|v)_proj and mlp quantized to NVFP4.

Result: Only self_attn.o_proj is quantized to fp8, linear_attn.in_proj_qkv and linear_attn.in_proj_z remain unquantized. While the NVFP4 is working as expected.

Another issue: the exported tokenizer metadata appears to use an unexpected class:
"tokenizer_class": "TokenizersBackend",

There was and still may be, an issue using Mixed Precision with NVFP4 in VLLM. Be aware of that, as that may be occurring here.

I closed my PR, as I didn't see yours @Sehyo . Your code was very close to mine, and your MTP handling is solid for peeps who turn it on. Thanks for Submitting this.

BenasdTW · 2026-03-04T04:57:11Z

There was and still may be, an issue using Mixed Precision with NVFP4 in VLLM. Be aware of that, as that may be occurring here.

@phaelon74 Thanks for the information! I’ll open a separate issue to discuss this, since it seems unrelated to this PR. I wonder if this is specific to the new linear_attn module, because self_attn.o_proj is being quantized correctly.

Edit: I found the issue. Turns out the regex wasn’t matching in my script.
Fixed version:

scheme_0 = FP8_DYNAMIC
scheme_0["targets"] = [
    "re:.*self_attn.o_proj$",
    "re:.*linear_attn.in_proj_qkv$",
    "re:.*linear_attn.in_proj_z$",
    "re:.*linear_attn.out_proj$",
]
scheme_1 = NVFP416
scheme_1["targets"] = [
    "re:.*self_attn.(q|k|v)_proj$",
    "re:.*mlp.experts.*.*_proj$",
]
ignore = ["re:.*lm_head", "re:visual.*", "re:model.visual.*", "re:.*mlp.gate$", "re:.*norm.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$", "re:.*mtp.*", "re:.*conv1d.*", "re:.*in_proj_a+", "re:.*in_proj_b+", "re:.*in_proj_c+"]
recipe = QuantizationModifier(
    config_groups={"group_0": scheme_0, "group_1": scheme_1}, ignore=ignore
)

dsikka

Overall this looks fine but I dont quite understand why we need an updated regex pattern, _update_config_expanded_ignore, or _graft_extra_weights?i I think generally, if we want to expand regex mapping, that shoud be done in a follow-up PR as it is not specific to Qwen3.5

I am able to generate quantized checkpoints without this

dsikka · 2026-03-05T20:56:30Z

src/llmcompressor/modifiers/quantization/quantization/base.py

+        # by regex (e.g. MoE router modules that aren't nn.Linear).
+        # Store expanded names on the model so the save wrapper can ensure
+        # they appear in config.json.
+        regex_patterns = [p for p in self.ignore if p.startswith("re:")]


Can you explain why you need this?

I did not have this in mine, and mine quanted and loaded successfully in VLLM, so would love to know as well.

@Sehyo can you explain why this is required?

mergify · 2026-03-05T21:00:52Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Sehyo · 2026-03-07T08:05:57Z

Overall this looks fine but I dont quite understand why we need an updated regex pattern, _update_config_expanded_ignore, or _graft_extra_weights?i I think generally, if we want to expand regex mapping, that shoud be done in a follow-up PR as it is not specific to Qwen3.5

I am able to generate quantized checkpoints without this

Graft extra weights is for re-adding MTP weights back in as they get dropped.

dsikka · 2026-03-07T14:50:17Z

Overall this looks fine but I dont quite understand why we need an updated regex pattern, _update_config_expanded_ignore, or _graft_extra_weights?i I think generally, if we want to expand regex mapping, that shoud be done in a follow-up PR as it is not specific to Qwen3.5
I am able to generate quantized checkpoints without this

Graft extra weights is for re-adding MTP weights back in as they get dropped.

@Sehyo I think we want to do this at the end when we're saving the checkpoint, not in the middle of calibration as it does not impact quantization.

Do you mind also resolving the quality issues?

mergify · 2026-03-07T14:53:25Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

paulplay-pm · 2026-03-10T03:57:25Z

Hi @Sehyo, could you try to use structured_outputs json with the model?

The Script to extract data of json

import base64
import fitz
import json
import requests
import sys
import time
from pathlib import Path
from PIL import Image
from io import BytesIO
from typing import List, Optional, Literal
from pydantic import BaseModel
from qwen_vl_utils import smart_resize

TARGET_DPI = 150
PYMUPDF_BASE_DPI = 72
PATCH_MULTIPLE = 32 
MAX_SIDE_PX = 3000

VLLM_URL = "http://192.168.1.75/v1/chat/completions"
MODEL = "QWEN3.5"

class Vendor(BaseModel):
    name: Optional[str]
    address: Optional[str]
    cif_nif: Optional[str]
    confidence: Literal["confident", "partial", "ambiguous"]

class Client(BaseModel):
    name: Optional[str]
    address: Optional[str]
    cif_nif: Optional[str]
    confidence: Literal["confident", "partial", "ambiguous"]

class Dates(BaseModel):
    issue_date: Optional[str]
    due_date: Optional[str]

class Amounts(BaseModel):
    subtotal: Optional[float]
    total_discount: Optional[float]
    tax_rate: Optional[float]
    tax_amount: Optional[float]
    total: Optional[float]
    currency: Optional[str]

class LineItem(BaseModel):
    description: str
    quantity: Optional[float]
    unit_price: Optional[float]
    discount_percentage: Optional[float]
    discount_amount: Optional[float]
    total: Optional[float]
    confidence: Literal["confident", "partial", "ambiguous"]

class PaymentInfo(BaseModel):
    iban: Optional[str]
    payment_method: Optional[str]

class FieldWithIssue(BaseModel):
    field_name: str
    issue_type: Literal["not_found", "partial", "conflicting", "ambiguous"]
    notes: Optional[str]

class InvoiceExtraction(BaseModel):
    extraction_status: Literal["success", "partial", "failed"]
    failure_reason: Optional[str]
    invoice_number: Optional[str]
    vendor: Optional[Vendor]
    client: Optional[Client]
    dates: Optional[Dates]
    amounts: Optional[Amounts]
    line_items: Optional[List[LineItem]]
    payment_info: Optional[PaymentInfo]
    fields_with_issues: List[FieldWithIssue]

SYSTEM_PROMPT = """\
You are an expert invoice extraction system. Your sole task is to extract information EXPLICITLY visible in the document.

STRICT RULES:
1. LINE ITEM SEPARATION: Extract EVERY SINGLE ROW as a separate line item object. NEVER merge distinct products, services, or descriptions into a single line item. If a row only contains a description, extract it as a separate line item with null amounts.
2. NUMBER PARSING: Spanish/European formats use ',' for decimals and '.' for thousands.
   - A quantity of '1,000' is ALMOST ALWAYS 1 unit (1.0), NOT one thousand.
   - A price of '59,0000' is 59.0.
   - An amount of '1.260,00' is 1260.0.
   - Convert these strictly to standard JSON numbers (floats).
3. MATHEMATICAL VALIDATION: Verify that (Quantity * Unit Price) - Discount = Total for EACH line.
4. DISCOUNT HANDLING:
   - 'discount_percentage': Use this ONLY if there is a '%' sign or the column header explicitly indicates a percentage.
   - 'discount_amount': Use this ONLY if it is a direct monetary deduction or absolute value.

Respond ONLY with valid JSON conforming to the provided schema.
"""

USER_PROMPT_PREFIX = """\
Analyze the provided invoice images (which may span multiple pages) and extract all fields according to the strict JSON schema.
"""

def pad_to_multiple(img: Image.Image, multiple: int = PATCH_MULTIPLE) -> Image.Image:
    w, h = img.size
    new_w = ((w + multiple - 1) // multiple) * multiple
    new_h = ((h + multiple - 1) // multiple) * multiple
    if new_w == w and new_h == h:
        return img
    canvas = Image.new("RGB", (new_w, new_h), (255, 255, 255))
    canvas.paste(img, (0, 0))
    return canvas

def clamp_resolution(img: Image.Image, max_side: int = MAX_SIDE_PX) -> Image.Image:
    w, h = img.size
    if max(w, h) <= max_side:
        return img
    scale = max_side / max(w, h)
    return img.resize((int(w * scale), int(h * scale)), Image.Resampling.LANCZOS)

def img_to_b64(img: Image.Image) -> str:
    buf = BytesIO()
    img.save(buf, format="PNG")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

def pdf_to_images(pdf_path: Path, dpi: int = TARGET_DPI) -> list[Image.Image]:
    zoom = dpi / PYMUPDF_BASE_DPI
    mat = fitz.Matrix(zoom, zoom)
    doc = fitz.open(pdf_path)
    pages = []
    for page in doc:
        pix = page.get_pixmap(matrix=mat, colorspace=fitz.csRGB, alpha=False)
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        img = clamp_resolution(img)
        pages.append(img)
    doc.close()
    return pages

def build_image_content(pages: list[Image.Image]) -> list[dict]:
    content = []
    min_pixels = 512 * 28 * 28
    max_pixels = 4608 * 28 * 28
    
    for img in pages:
        w, h = img.size
        new_h, new_w = smart_resize(h, w, min_pixels=min_pixels, max_pixels=max_pixels, factor=28)
        resized_img = img.resize((new_w, new_h), Image.Resampling.LANCZOS)
        padded_img = pad_to_multiple(resized_img, multiple=28)
        
        content.append(
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{img_to_b64(padded_img)}"},
            }
        )
    content.append({"type": "text", "text": USER_PROMPT_PREFIX})
    return content

def call_vllm(payload: dict) -> dict:
    time.sleep(2)
    response = requests.post(
        VLLM_URL,
        json=payload,
        headers={"Content-Type": "application/json"},
    )
    response.raise_for_status()
    return response.json()

def extract_invoice(pdf_path: str) -> dict:
    pages = pdf_to_images(Path(pdf_path))
    content = build_image_content(pages)

    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": content},
        ],
        "temperature": 0.3,
#	"top_k": 20,
#	"min_p": 0.0,
#	"top_p": 0.8,
#	"presence_penalty": 1.5,
#	"repetition_penalty": 1.0,
        "chat_template_kwargs": {"enable_thinking": False},
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": "invoice_extraction",
                "schema": InvoiceExtraction.model_json_schema(),
                "strict": True,
            },
        },
    }

    response_data = call_vllm(payload)
    message = response_data["choices"][0]["message"]

    content_str = message.get("content")
    if content_str is None:
        raise ValueError("None")

    return json.loads(content_str)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit(1)

    result = extract_invoice(sys.argv[1])
    print(json.dumps(result, indent=2, ensure_ascii=False))

It is my quant script, based on your branch/PR but using GPTQ.

import os
import shutil

import torch
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)
from datasets import concatenate_datasets, load_dataset
from huggingface_hub import snapshot_download
from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

MODEL_ID = "Qwen/Qwen3.5-35B-A3B"
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-GPTQ-W4A16-G32"

NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 2048

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    MODEL_ID,
    torch_dtype="auto",
    device_map=None,
)
processor = AutoProcessor.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

samples_per_dataset = NUM_CALIBRATION_SAMPLES // 2

ds_ultrachat = load_dataset(
    "HuggingFaceH4/ultrachat_200k",
    split=f"train_sft[:{samples_per_dataset}]",
)
ds_nemotron = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v2",
    split=f"chat[:{samples_per_dataset}]",
)

ds_ultrachat = ds_ultrachat.select_columns(["messages"])
ds_nemotron = ds_nemotron.select_columns(["messages"])
ds = concatenate_datasets([ds_ultrachat, ds_nemotron])
ds = ds.shuffle(seed=42)


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = GPTQModifier(
    config_groups={
        "group_0": QuantizationScheme(
            targets=["Linear"],
            weights=QuantizationArgs(
                num_bits=4,
                type=QuantizationType.INT,
                strategy=QuantizationStrategy.GROUP,
                group_size=32,
                symmetric=True,
                dynamic=False,
            ),
        )
    },
    ignore=[
        "lm_head",
        "re:.*mlp.gate$",
        "re:.*mlp.shared_expert_gate$",
        "re:.*linear_attn.*",
        "re:.*visual.*",
    ],
    bypass_divisibility_checks=False,
    block_size=128,
    dampening_frac=0.01,
    actorder="static",
    offload_hessians=False,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    moe_calibrate_all_experts=True,
    pipeline="sequential",
)

model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)

cache_dir = snapshot_download(MODEL_ID, allow_patterns=["*.json"])

for filename in [
    "vocab.json",
    "preprocessor_config.json",
    "video_preprocessor_config.json",
    "tokenizer_config.json",
]:
    src = os.path.join(cache_dir, filename)
    dst = os.path.join(SAVE_DIR, filename)
    if os.path.exists(src):
        shutil.copyfile(src, dst)
        print(f"Copied: {filename}")
    else:
        print(f"Not Found in Cache: {filename}")

The issue I'm seeing is that the model often fails to respond or hallucinates data when I request Structured Outputs (JSON). However, when I send the request without forcing a structured format, the responses make much more sense. Interestingly, models like Qwen3VL 30B A3B Instruct handle this correctly.

Tested using VLLM main branch.

Many thanks for your time.

I noticed your code uses from transformers import AutoProcessor, AutoTokenizer, Qwen3_5MoeForConditionalGeneration, which requires transformers>=5.2.0. However, the from llmcompressor import oneshot code indicates that the latest version of llmcompressor depends on transformers>=4.56.1, <=4.57.6.
Could you please advise on how to resolve this transformers version conflict? Which version of transformers are you currently using? Thank you.

dsikka · 2026-03-11T18:50:40Z

Hi @Sehyo I am going to break this PR and land it in smaller pieces as some of this functionality is now out of date.

Thank you for the contribution!

phaelon74 · 2026-03-11T19:29:49Z

Hi @Sehyo I am going to break this PR and land it in smaller pieces as some of this functionality is now out of date.

Thank you for the contribution!

Apologies for this ask @dsikka , but can you map it out please, as I am having to use my PR to making my Qwen3.5 quants work, so would be nice to know which PRs you will align into implementation, so I know when they land, etc.

Sehyo requested review from dsikka and kylesayrs as code owners February 18, 2026 11:12

mergify bot added the documentation Improvements or additions to documentation label Feb 18, 2026

gemini-code-assist bot reviewed Feb 18, 2026

View reviewed changes

src/llmcompressor/modeling/qwen3_5_moe.py Outdated Show resolved Hide resolved

Sehyo force-pushed the feat/qwen3-5-moe-calibration branch from 83c7bd8 to 1d428f9 Compare February 18, 2026 11:14

dsikka reviewed Feb 18, 2026

View reviewed changes

dsikka added ready When a PR is ready for review qwen For any PR / issue related to Qwen support nvfp4 For any PR / issue related to NVFP4 support labels Feb 18, 2026

mergify bot added the quality-failed label Feb 18, 2026

mergify bot removed the quality-failed label Feb 20, 2026

Sehyo force-pushed the feat/qwen3-5-moe-calibration branch 2 times, most recently from 642ba83 to d030961 Compare February 20, 2026 09:51

Sehyo requested review from HDCharles and brian-dellabetta as code owners February 20, 2026 09:51

Merge branch 'main' into feat/qwen3-5-moe-calibration

28414db

fix: address PR review feedback from HDCharles

dcfdce2

- Remove unnecessary disable_onloading() wrapper in qwen3_5_moe.py - Add hasattr fallback for _no_split_modules in get_no_split_params - Use public match_named_modules API instead of private _match_name

dsikka mentioned this pull request Mar 4, 2026

Add Qwen3.5 MoE Modeling #2436

Closed

dsikka reviewed Mar 5, 2026

View reviewed changes

Merge branch 'main' into feat/qwen3-5-moe-calibration

0fcf3ed

mergify bot added the quality-failed label Mar 5, 2026

Merge branch 'main' into feat/qwen3-5-moe-calibration

6ec9128

mergify bot removed the quality-failed label Mar 7, 2026

mergify bot added the quality-failed label Mar 7, 2026

Copilot AI mentioned this pull request Mar 9, 2026

feat: add Qwen3.5 MoE calibration module and compatibility patches groxaxo/llm-compressor#1

Merged

dsikka self-assigned this Mar 11, 2026

2imi9 mentioned this pull request Mar 12, 2026

[Examples] Add Qwen3.5-27B NVFP4A16 and MXFP4A16 quantization examples #2467

Open

3 tasks

Conversation

Sehyo commented Feb 18, 2026

Summary

Details

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Sehyo commented Feb 18, 2026

Uh oh!

Sehyo commented Feb 18, 2026

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 18, 2026

Uh oh!

aabbccddwasd commented Feb 19, 2026

Uh oh!

Sehyo commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sehyo commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsikka commented Feb 19, 2026

Uh oh!

dsikka commented Feb 19, 2026

Uh oh!

Sehyo commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sehyo commented Feb 19, 2026

Uh oh!

Sehyo commented Feb 20, 2026

Uh oh!

Sehyo commented Feb 22, 2026

Uh oh!

Sehyo commented Feb 25, 2026

Uh oh!

HDCharles commented Mar 2, 2026

Uh oh!

JartX commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDCharles commented Mar 2, 2026

Uh oh!

JartX commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JartX commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sehyo commented Mar 3, 2026

Uh oh!

JartX commented Mar 3, 2026

Uh oh!

JartX commented Mar 3, 2026

Uh oh!

BenasdTW commented Mar 3, 2026

Uh oh!

phaelon74 commented Mar 4, 2026

Uh oh!

BenasdTW commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka Mar 5, 2026

Sehyo commented Feb 19, 2026 •

edited

Loading

Sehyo commented Feb 19, 2026 •

edited

Loading

Sehyo commented Feb 19, 2026 •

edited

Loading

JartX commented Mar 2, 2026 •

edited

Loading

JartX commented Mar 2, 2026 •

edited

Loading

JartX commented Mar 3, 2026 •

edited

Loading

BenasdTW commented Mar 4, 2026 •

edited

Loading

dsikka left a comment •

edited

Loading

phaelon74 commented Mar 11, 2026 •

edited

Loading